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Abstract 

We investigate the accuracy of the two most common estimators for 
the maximum expected value of a general set of random variables: a gen- 
eralization of the maximum sample average, and cross validation. No 
unbiased estimator exists and we show that it is non-trivial to select a 
good estimator without knowledge about the distributions of the ran- 
dom variables. We investigate and bound the bias and variance of the 
aforementioned estimators and prove consistency. The variance of cross 
validation can be significantly reduced, but not without risking a large 
bias. The bias and variance of different variants of cross validation are 
shown to be very problem-dependent, and a wrong choice can lead to very 
inaccurate estimates. 



1 Introduction 

We often need to estimate the maximum expected value of a set of random 
variables (RVs), when only noisy estimates for each of the variables are given0 
For instance, this problem arises in optimization in stochastic decision processes 
and in algorithmic evaluation. 

Formally, we consider a finite set V = {Vi, . . . , Vm} of M > 2 independent 
RVs with finite means /ii, . . . , /iM and variances cr^, . . . , aj^. We want to find 
the value of fJ,*{V), defined by 

/i*(V) = max/ii = max E{Vi} . (1) 

i i 

We assume the distribution of each Vi is unknown, but a set of noisy samples X 
is given. The question is how best to use the samples to construct an estimate 
/i*(V) « ^*(V). We write and /i* when V and X are clear from the context. 



Without loss of generality, we assume that we want to maximize rather than minimize. 



It is easy to construct consistent estimators, but we are also interested in the 
quality for small sample sizes. The mean squared error (MSE) is the most 
common metric for the quality of an estimator, but sometimes (the sign of) the 
bias is more important. Unfortunately, as we discuss in Section [21 no unbiased 
estimators for exist. 

A common estimator is the maximum estimator (ME), which constructs 
estimates fli ~ /ii and then uses /i* = max^ p,i . When Xi d X contains direct 
samples for Vi, and fii is the average of Xi, the ME is simply the maximum 
sample average. The ME on average overestimate s /x*. This bia s has been 
rediscovered sev eral times, for instance in economics IVan den SteenI 2004l | and 
decision making Smith and Winkleil 2006|. It can l e ad to overe stimation of th e 
performance of algorithms Varma and Simon |2006 [. Cawlev an d Talbot _^10| |. 
and poor policies in reinforcement learning algorithms Ivan Hasseltl l20riB7lt 
is related io_over- fitting in model se lection, selection bias in sample selection 
Heckman 1979l | and featur e selection Ambroise and McLachlan |2002l | , and the 



winner's curse in auctions ICapen et al.l 1971 

The most common alternative to avo id this bias is 
LarsonI [igSlj . iMosteller and TukevI |l968| 



and thereafter to estimate 



i4* 



cross validation (CV) 
If CV is used to construct each fii, 
as d escribed in Section [5^ . this is called nested 
C V or "double cross" iStond [1974| | . Unfortunately, (nested) C V can lead to a 
large variance. Perhaps surprisingly, we show the absolute bias of CV can be 
larger than the bias of the ME that we are trying to prevent. However, the bias 
of CV is provably negative, which can be an advantage. 

In this paper, we give general distribution-independent bounds for the bias 
and variance of the ME and CV. We present a new variant of CV and show 
that it is very dependent on V which CV estimator is most accurate in terms of 
MSE. Therefore, it is non-trivial to construct accurate CV estimators without 
some knowledge about the distributions of V . We discuss why standard 10-folds 
CV is often not a bad choice for model selection, but show that in other settings 
other estimators may be much more accurate. 

We now discuss two motivating examples to highlight the practical impor- 
tance of this general topic. 



Learning Algorithms Many learning algorithms explicitly maximize noisy 
val ues to update their intern al parameters. For instance, in reinforcement learn- 
ing Sutton and Bartol 1998| the goal is to find strategies that maximize a reward 
signal in a (sequential) decision task. Inaccurate biased estimators for /i* can 
have adverse effects on the speed of learning and the strategies that are learned 
van Hasseltl [201 la| . 



Evaluation of Algorithms Most machine-learning algorithms have tunable 
parameters. Internal parame ters, such as th e Lagrangian multipliers of a sup- 
port vector machine (SVM) IVapnikI [l995l |. are optimized by the algorithm. 
Hyper-parameters, such as the choice of kernel function in a SVM, are often 
tuned manually or chosen with domain knowledge. Other relevant choices by 
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the experimenter — such as which algorithms to consider and the representation 
of the problem — can be summarized as meta-parameters. 

Typically, we evaluate a set C of configurations, where each Ci d C denotes 
a specific algorithm with specific hyper- and meta-parameters. Often, each 
evaluation is noisy, due to (pseudo-)randomness in the algorithms or inherent 
randomness in the problem. The performance of Ci is then a random variable 
Vi, and we want to construct an est i mate A* for the best performance /i*H 

For instance, Varma and Simon 20Q6l | note that the ME results in overly- 
optimistic prediction errors and propose to use nested CV. They evaluate an 
SVM for various hyper-parameters on an artificial problem, with an actual error 
of 50%. The estimate by the ME is 41.7% and nested CV results in 54.2%. 
Varma and Simon argue that the latter exceeds 50% because nested CV removes 
a sample from each training set. However, in fact the difference between 50% 
and 54.2% is a demonstration of a completely different general bias that we 
discuss in Section [221 This bias has received very little attention, although — as 
we will show — it is not in general smaller than the bias caused by using the ME. 



Overview In the rest of this section, we discuss related work and (notational) 
preliminaries. In Section [2l we discuss the bias of estimators in general. In 
Section [31 we discuss the properties of the ME and of CV, including bounds on 
their bias and variance. In Section^lwe discuss concrete settings with empirical 
illustrations. Section [5l contains a discussion and Section [SI concludes the paper. 



1.1 Related Work 



The bootstrap lEfron and Tibshiranil jl993l | is a resampling method that can be 
used to estimate the bias of an es timator, in order to reduce this bias. Based 
on this, Tibshirani and Tibshirani iTibshirani and Tibshiranil |2009| propose an 
estimator for /u* for model selection in classification. Inevitably — see Section 
[21 — the resulting estimate is still biased, and it is typically more variable that 
the original est imate. Also specifica lly considering model selection for classifiers, 



Bernau et al. Bernau et al. 



201 1| propose a smoothed version of nested CV. 



The resulting estimator performs similar to normal (nested) CV, which in turn 
is shown to typically be more accurate than the approach by Tibshirani and 
Tibshirani. In this paper we focus on CV and the ME, which are by far the 
most widely used. 

The probleni of estimating is related to the multi- armed bandit framework 
RobbinsI [l95^ . lBerrv and Fristedtl |l985| . where the goal is to find the identity 
of the action with the maximum expected value rather than its value. The 
focus in the literature on multi-armed bandits is often on how best to collect 
samples. In contrast, in this paper we assume that a set of samples is given. A 
discussion on how best to collect samples to minimize the bias or MSE is outside 



^Sometimes we are more interested in the configuration that optimizes the performance 
than in the resulting performance, but often the performance itself is at least as important. 
In part, this depends on whether the focus of the research is on the algorithms or on the 
problem. 
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the scope of this pape r, although we do note that minimizing onhne regret 
Lai and Robbind 1985 |. Auer et al. 2002 1 does not necessarily correspond to 
minimizing the online MSE of the estimator. 



1.2 Preliminaries 

The measurable domain of Vi is Xi, and fi-.Xi^M. denotes its probability 
density function (PDF), such that Hi = J.^ x fi{x) dx . For conciseness, we 
assume Xi = M. We assume these PDFs fi are unknown and therefore /i* = 
maxi fii can not be found analytically. 

We write fii{X) for an estimator for /i^ based on a sample set X. Similarly, 
is an estimator for /i,. We write jli and /i, when X is clear from the 
context. If C X is a set of unbiased samples for Vi, fii might be the sample 
average. In that case, fli is unbiased for In general, fli can be biased for /z^. 
As discussed in the next section, no general unbiased estimator for exists, 
even if all fli are unbiased. 

The following definitions will be useful below, when stating necessary and 
sufficient conditions for a strictly positive or a strictly negative bias. The set of 
optimal indices for RVs V is defined as 

o(y) = {z Im. -A^*} • (2) 

The set of maximal indices for samples X is defined as 

=max/ij(X)| . (3) 

An estimator is called optimal or maximal whenever its index is optimal or 
maximal, respectively. Note that optimal estimators are not necessarily maximal 
and maximal estimators are not necessarily optimal. 



M{X) 



2 The Bias of an Estimator 

Let V be a function space containing all admissible sets of M RVs. We might 
know V, but not the precise identity of G V. For instance, V may be the set 
of all sets of M normal RVs with finite moments. Let p : V — > M be a PDF over 
V. The expected MSE of an estimator fi^, is equal to 

/ p{V) [ P{X\V) ifi^X) - M*)' dX dV, (4) 

In any given concrete setting, there is a single unknown set V. Therefore, p does 
not exist 'in the world'. Rather, p might model our prior belief about which sets 
V are likely in a given setting, or it might specify the V for which we would like 
an estimator to perform well. The MSE consists of variance and bias. To reason 
in some generality about which estimators are good in practice, we discuss the 
non-existence of unbiased estimators and the direction of the bias. 
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Non-Existence of Unbiased Estimators By definition, /i* is a general 
unbiased estimator (GUE) for V if and only if 



W eV : E{fi,\V} ^ fi, . (5) 

Unfortunately, for most V of interest no such estimator exists. For instance, 
Blum enthal and Cohen ' '19685 show no GUE exists for two normal distributions 



and iDharival et alj 1985] proved this for arbitrary M > 2 and for more gen- 
eral distributions, including the exponential family. Essentially, the argument 
is that a reasonable estimator for fj,^, depends smoothly on the values of the 
samples, whereas the real value /x* is a piece-wise linear function with a dis- 
continuous derivative. We can not know the location of these discontinuities 
without knowing the actual maximum. 

Note that ([5]) is already false if V contains only a single set of variables for 
which fl^ is biased. However, bias alone does not tell us everything, and a low 
bias does not necessarily imply a small expected MSE. 



The Direction of the Bias In some cases, the direction of the bias is very 
important. Suppose we test an algorithm for various hyper-parameters and 
observe that the best performance is better than some baseline. If we simply 
use the highest test result, it can not be concluded that the algorithm can really 
structurally outperform the baseline for any of the specific hyper-parameters. 
Although this may sound trivial, it is common in practice: when we manually 
tunc hyper- or meta-parameters on a problem and use the best result, we are 
using maxi/ti, which has non- negative bias. It is hard to avoid optimizing on 
meta-parameters: these include the very (properties of the) problem we test the 
algorithm on. 

The practical implication of this positive bias is that the algorithm will 
disappoint in future evaluations on similar (real-world) problems. In contrast, 
if we use an estimator with non-positive bias and our estimate is higher than 
the baseline, we can have much more confidence that the algorithm can reach 
that performance consistently with a properly tuned hyper-parameter. This is 
similar to the considerations about overfitting in model selection, where CV is 
most often used. We prove below that GV indeed has non-positive bias, and 
can therefore avoid overestimations of /i*. 

As another example, the performance of most machine-learning algorithms 
improves when more data is available. When the data collection is expensive it is 
useful to predict how an algorithm performs when more data is available, before 
actually collecting this data. An ovcrestimation of the future performance can 
lead to a misallocation of resources, since the collected data may be less useful 
than predicted. An underestimation means we may be too pessimistic, and too 
often decide not to collect more data. Whether or not the false positives are 
more important than the false negatives depends crucially on specifics of the 
setting. 
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3 Estimators for the Maximum Expected Value 



In this section, we discuss the ME and CV estimators for /i* in detail We bound 
the biases and variances of all estimators, discuss similarities and contrasts, and 
prove consistency. We introduce a low- variance variant of CV. We give necessary 
and sufficient conditions for non-zero biases to occur for all estimators, and 
perhaps surprisingly we show that there are settings in which the negative bias 
of all variants of CV is larger in size than the positive bias of ME. All proofs 
are given in an appendix and the end of this paper. 



3.1 The Maximum-Estimator Estimator 

The maximum-estimator (ME) estimator for |.J,.^, is 



(6) 



where fii is a (possibly biased) estimator for ^i. Because it is conceptually 
simple and easy to implement, the ME estimator is often used in practice. The 
theorem below proves its bias is non-negative and gives necessary and sufficient 
conditions for a strictly positive bias. The theore m is stronger and mo r e gen eral 
than some similar earlier theorems. For instance. Smith and Winkler 2006l | do 
not consider the possibility of multiple optimal variables, and do not discuss 
necessity of the conditions for a strictly positive bias. 

Theorem 1. For any given set V , M > 1 and unbiased estimators fii, E{jii \ V} - 

i?{Arin>M* , 

with equality if and only if all optimal indices are maximal with probability one. 

Theorem [1] implies a lower bound of zero for the bias of the ME. An upper 
bound for arbitrary means and variances is given bv I Aven 1985 [: 



Bias(/ir) < 



M - 1 
M 



M 



^Var(/i,) 



(7) 



which is tight when the estimators are iid Arnold and Groeneveld . 1979| . indi- 
cating that iid variables are a worst-case setting. 

We do not know of previous work that bounds the variance, which we discuss 
next. 

Theorem 2. The variance of the ME estimator is bounded by Var{fi"'^) < 
J2fLi Varifii). 

Theorem [2] and bound ([7]) imply that fi^^^ is consistent for /z* whenever each 
fii is consistent for Hi and that MSE(/i"'^) < 2 ^^^j^ Var (/i^). 
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3.2 The Cross- Validation Estimator 



In general, /i* can be considered to be a weighted average of the means of all 
optimal variables: /i* = X^i^i € 0{V))iii, where I is the indicator 

function. We do not know 0{V) and ^i, but with sample sets A,B we can 
approximate these with M.{A) and fii{B) to obtain 

1 

~ ^* ^ ]M(Aj\J^^^^' ^ M{A))ili{B) . (8) 

If A = B = X, this reduces to /i"^. However, suppose that A and B are 
independent. This idea leads to cross-validation (CV) estimators. Of course, 
CV itself is not new. However, it seems to be less well-known how properties of 
the problem affect the accuracy and that CV can be quite biased. 

We split each X into K disjoint sets X'^ and define ^l^ = jj,i{X''). For 
instance, fi^ might be the sample average of X^. We consider two different CV 
estimators. In both methods, for each k G {!,..., K} wc construct an argument 
set a'' and a value set v'^. 

Low-bias cross validation (LBCV) is the 'standard' CV estimator, where 
K — 1 sets are used to build the argument set a'' (the model) , and the remaining 
set is used to determine its value: 



a. 



''-MX\X'') and v^=MX'')=fi'l . 



Low-variance cross validation (LVCV) reverses the definitions for a'' and i)'': 

= fi{X^) = ill and t)f = il{X^ \ X^) . 

We do not know of any previous work that discusses this variant. However, 
its lower variance can sometimes result in much lower MSEs than obtained by 
LBCV. For both LBCV and LVCV, if jli{X) is the sample average of Xi and 
all samples are unbiased, then E{a!l} = E{v\^ = fn. 

For either approach, A^*^ is the set of indices that maximize the argument 
vector. For LBCV this implies M'' = M{X \ X^) and for LVCV this im- 
plies M'^ = A4{X''). We find the value of these indices with the value vector, 
resulting in 

We then average over all K sets: 



ieM 



k 



K , K , M 

I 



k=l fe=l ■ ' ' ieM'' 



where either /tj^ = /ij;^^^ or /ij^ = /t^^^^, depending on the definitions of d'- 
and v^. 
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The construction of /<J performs the approximation: v'^ « vf^ « /i^^ = /i*, 
where i G and i* € O. The first approximation results from using M.^ 
to approximate O and is the main source of bias. The second approximation 
results from the variance of . 

For large enough X, K can be treated as a parameter that trades off bias 
and variance. For LBCV larger K implies less bias and more variance, while 
for LVCV it implies more bias and less variance. For K — 2, LBCV and LVOV 
are equivalent, li K > 2, LVCV is more biased but less variable than LBCV, 
since A^*^ is then based on fewer samples while vf is based on more samples. 
When \/i : \Xi\ — K for LBCV, is based on a single sample, resulting in a 
large variance. This variant is commonly known as leave-one-out CV. When 
Vi : \Xi\ = K for LVCV, a*^ is based on a single sample, potentially resulting in 
large bias due to large probabilities of selecting sub-optimal indices. 

The bias of CV for has received comp aratively little attention. Sometimes 



the bias is mentioned without expl anation Kohayil 19951 , and sometimes it is 
even claimed that CV is unbiased Mannor et al.n2o'o7| . Often, any observed 



bias is attributed to the fact that can be biase d when it based on -^^^^jXI 
rather than \X\ samples (Varma and Simo"n . 2006| . This can be a factor, but 



the bias induced by using AA^ for O is often at least as important, as will be 
demonstrated below. Some confusion seems to arise from the fact that is 
often unbiased for /i^. Unfortunately, this does not imply that /tj^ is unbiased 
for /i*. 

Next, we prove that CV estimators can have a negative bias even if a and 
V are unbiased, and we give necessary and sufficient conditions for a strictly 
negative bias. 

Theorem 3. If E{jj,^ | = /ii is unbiased then E{ll'^^\ V} < /i* is negatively 
biased, with a strict inequality if and only if there is a non-zero probability that 
any non- optimal index is maximal. 

The theorem shows that fi]^^'^^ and p.]f^^ on average underestimate if 
and only if there is a non-zero probability that i € A4'^{X) for some i ^ 0{V). 
A prominent case in which this does not hold is when all variables have the 
same mean, since then i e 0{V) for all i. Interestingly, this implies that CV is 
unbiased when the Vi G V are iid, which is a worst case for the ME. Theorem 
[3] implies that the bias of CV is bound from above by zero. We conjecture the 
bias is bound from below as follows. 

Conjecture 1. Let E\^fi^ | — fii. Then 



1 ^ 



k=l \ 



M 



Y,Var{a^) 



We do not prove this conjecture here in full generality, but there is a proof 
for M = 2 in the appendix. It makes intuitive sense that the bias of /tj^ 
depends only on the variances Var (a^) if each fi^ is unbiased. The bias of the 
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CV estimators is unaffected by the fact that jx^^ averages over K estimators /tj, 
but K does affect the bias by regulating how many samples are used for each 
af^. As mentioned earlier, for LVCV larger a K implies a higher bias since then 

is more variable, while for LBCV a larger K implies a lower bias since then 

is less variable. 

Although CV is known for low bias and high variance, the next theorem 
shows its absolute bias is not necessarily smaller than the absolute bias of the 
ME. 

Theorem 4. There exist V andN =^\X\ such that \Bias{fi'^'' \N)\> \Bias{tll"' \ N) 
for any K and for any variant of CV. 

Two different experiments in Section |4] prove this theorem, since there even 
the negative bias of leave-one-out LBCV is larger in size than the positive bias 
of ME. 

Theorem 5. The variance offif-'^ is hounded by 

^ K M 

k=l i=l 

If each /tf is unbiased, the variance of LVCV is necessarily smaller than that 
of LBCV and the same bound applies trivially to A^^'^^. 

Corollary 1. If fii is the sample average of Xi and — \Xi\/K for all k, 
then Far (/if ) < J2fLi Varifi,) for both LBCV and LVCV. 

Conjecture [1] and Theorem [S] imply that CV is con sistent if eac h /tj is con- 
sistent and K is fixed (or slowly increasing, see also Shao 1993| ). and that 



MSE(An <2Ef=iVar (fi,). 

4 Concrete Illustrations 

To illustrate that it is non-trivial to select an accurate estimator, we discuss 
some concrete examples. 

4.1 Mult i- Armed Bandits for Internet Ads 

The framework of multi-armed bandits can be used to opt imize which ad is 
shown on a website [Langford et al. ■ l2008llStrehl et all . l2010l |. Consider M ads 



with unknown fixed expected returns per visitor fii. Bandit algorithm can be 
used to balance exploration and exploitation to optimize the online return per 
visitor, which converges to /i,. However, quick accurate estimates of can be 
important, for instance to base future investments on. Additionally, placing any 
ad may induce some cost c, so we may want to know quickly whether /i, > c. 

For simplicity, assume each ad has the same return per click, such that only 
the click rate matters and each Vi can be modeled with a Bernoulli variable with 
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Figure 1: The MSE for /i"'', /i^'''^^ and /il;;^'^^ for different settings, averaged 
over 2,000 experiments. The left-most bar is always p,^^. The other bars are, 
from left to right, leave-one-out LVCV, 10-folds LVCV, 5-folds LVCV, 2-folds 
CV, 5-folds LBCV, 10-folds LBCV and leave-one-out LBCV. Note that 2-folds 
LVCV is equivalent to 2-folds LBCV, which are therefore not shown separately. 



mean /Zi and variance (1 — ^i)fii. In our first experiment, there are N = 100, 000 
visitors, M = 10, M = 100 or M = 1000 ads, and \fi : m = 0.5. Ah ads are 
shown equally often, such that \/i : Ni = N/M. Because all means are equal, 
Theorem [3] implies that CV estimators are unbiased; their MSE depends solely 
on the variance. In the second — more realistic — setting, the M mean click rates 
are distributed evenly between 0.02 and 0.05, there are TV = 300,000 visitors, 
and M = 30, M = 300, or M = 3000 ads. 

The results are shown in the first four plots in Figure [T] We show the root 
MSE (RMSE), such that the units are percentage points. Within the RMSEs, 
the contributions of the bias and the varia nce are shown. Note that MSE = 
bias^ -I- variance, and therefore RMSE = \/bias^ + variance bias -I- std dev. 
This implies that the depicted contributions of bias and variance to the RMSE 
are not in general exactly equal to the bias and standard deviation, but this 
depiction does allow us to see directly how many percentage points of error are 
caused by bias and by variance. 

In the first setting (left plot) CV is indeed unbiased. Leave-one-out LVCV 
has the lowest variance of all CV methods — it is barely visible — which implies 
it has the smallest MSE. For M = 1000 ads, the huge bias of the ME causes it 
to overestimate the actual maximal click rate by more than 15%. 

In the second setting (middle three plots), there is a clear trade-off in CV: 
LVCV with large K has large bias and small variance, whereas LBCV with large 
K has small bias and large variance|f| The bias of the CV estimators is clearly 
important, even though each /i^ is unbiased. Even for leave-one-out LBCV the 
bias is non-negligible: for M — 30 its bias is larger than the bias of the ME. 
Interestingly, when M increases (and the number of samples per ad decreases 

^Sometimes the bias of LBCV seems to increase slightly for higher K. These are noise- 
related artifacts. 
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correspondingly) the error for leave-one-out LVCV stays virtually unchanged, 
at approximately 1.3%. Since the error of all other estimators increases with 
increasing Af , this implies that leave-one-out LVCV goes from being by far the 
least accurate for AI = 30 to almost the most accurate for M = 3000. In 
contrast, the ME goes from being the most accurate for M = 30 to the least 
accurate for M = 3000. The reason is that for increasing M, the variables are 
relatively more similar to iid variables, which is a best case for LVCV and a 
worst case for the ME. In all three cases, 10- and 5-folds LVCV are a good 
choice. 

4.2 Evaluation of Algorithms 

We now consider a regression problem. The goal is to fit polynomials on noisy 
samples from a function r{y) = 4(sin(y) + sin(2?/)). Let X = {{y, r{y) +uj)\y & 
y} denote a noisy data set for inputs Y , where uj is zero-mean Gaussian noise 
with variance cr^ = 4. Let pi denote a polynomial of degree i, of which the 
coefficients are fitted with least-squares on X. 

Let Y = {0, 0.05, . . . , 3.95, 4} be 81 equidistant inputs. We want to maximize 
the negative MSE. The lowest expected MSE of ffiting eachp^ on 81 samples and 
testing on an independent test set of 81 samples is obtained at 4.34 for i = 5, 
which implies /i, = fi^ = —4.34. We construct 1,000 independent noisy sets 
X = {(y, r(y) + uj) \ y ^ Y}. For each X, we conduct the following experiment. 

For any given Z <Z X , fii is defined by an inner CV loop as follows. For 
each z G Z, we fit pi on Z \ {z} and test the error on z to obtain an error 
ei{z). We average these errors to obtain: fii{Z) = Y^^&z This implies 

fii is biased, since pi is fitted on \Z\ — 1 < 81 samples. For the ME, fi"^ = 
me.Xi fii{X) which means \Z\ = 80 samples are used to fit each pi. For LBCV, 

= fii{X \ X^) which means \Z\ = ^^81. For LVCV, a*^ = h{X^), which 
means \Z\ = -^81. Since \Z\ can then be much smaller than 81, LVCV can be 
significantly biased. We consider K e {2,3,9,81}. When K = 81, LBCV is 
also known as nested leave-one-out CV. Figure [1] (right plot) shows the results. 

LVCV is not shown for if = 81 and K = 9: LVCV with A' = 81 is meaning- 
less, since one cannot fit a polynomial on a single point. The MSE for K — 'd \s 
huge. In sharp contrast with the previous settings, LVCV fares poorly — even in 
terms of variance — and leave-one-out LBCV is the best CV estimator. However, 
interestingly the ME is more accurate than all CV estimators, and even the size 
of its bias (0.018) is much smaller than that of n-fold LBCV (-0.190). 

5 Discussion 

Our results show that it is hard to choose an estimator that is good in general. 
Unfortunately, the best choice in one setting can be the worst choice in another. 
A poorly chosen CV estimator can be far less accurate than the ME. This does 
not imply that we suggest using the ME; it is often very biased. 
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A potential advantage of CV estimators in some settings is a guaranteed 
non-positive bias. This can be desirable even if the estimator is less accu- 
rate. However, in our results the recommendation to always use 10-folds LBCV 



Kohavil Il995l | seems unfounded. When each p,i is unbiased and especially when 
M is large, LVOV often performs much better. On the other hand, when each 
estimator fii has a bias that decreases with the number of samples, the bias of 
LVCV can become prohibitively large, as illustrated in the regression setting. 
This explains why 10-folds LBCV is often not a bad choice for model selection, 
as long as M is fairly small and fii is fairly biased. However, note that 5- and 
10-folds LBCV were the most accurate estimator in none of our experiments. 

As a general recommendation, it may be good to try both the ME and one 
or more CV estimators. If the estimates are close together, this indicates they 
are more likely to be accurate. Although the true maximum expected value 
will often lie between the estimate by the ME and those by CV, one should not 
simply average these estimates: as we have shown that for instance the ME can 
be very biased in some settings, and hardly biased in others. Furthermore, the 
potentially excessive variance of some variants of LBCV implies that in some 
cases its estimate may itself be an overestimation, which is why we recommend 
to include LVCV in the analysis. 

Alternative estimators Of course, there are possible alternatives to the es- 
timators we discussed. First, one can consider using the maximum of some 
lower confidence bounds on the individual value estimates. Although this does 
counter the overestimation of ME, it can not be guaranteed that this does not 
lead to an underestimation in its place. Furthermore, it is non-trivial to select 
a good confidence interval, and the resulting estimate will typically be much 
more variable than the ME. 

Secon d, for model-se lection there exist criteria such as AIC [AkaikeL \l97^ 



and BIC [Schwara . Il978| that use a penalty term based on the number of param- 
eters in the model. Obviously, such penalties are only useful when comparing 
homomorphic models with different numbers of parameters, and therefore do 
not apply to the more general setting we consider in this paper. Furthermore, 
the main purpose of these criteria is not to give an accurate estimate of the 
expected value of the best model, but to increase the probability of selecting it. 
These goals are related, but unfortunately not equivalent. 

Finally, one can estimate belief distributions Fi for the location of each fii, 
for instance with Bayesian inference. With these distributions, we can estimate 
/K*. This approach is less general, since it requires prior knowledge about V, 
but then it does seem reasonable. The probability that the maximum mean is 
smaller than some x is equal to the probability that all means are smaller than 
X. Therefore, its CDF is -Fmax(a;) = YifLi Pi{x)j which we can use to estimate 
/X* . The resulting Bayesian estimator (BE) is 

poo M 
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where fi{x) = ■^Fi{x). To show a perhaps counter- intuitive result from this 
approach, we discuss a small example. Consider two Bernoulli variables. We 
consider all means equally likely and use a uniform prior Beta distribution, with 
parameters a = /3 = 1. Suppose /ii = /i2 = 0.5. We draw two samples from 
each variable. The expected estimate for the ME is 1^, for a bias of « 0.156. 
CV is unbiased, since the means are equal. For the BE Fi{x) is 1 — (1 — x)^, 
3x^ — 2x^ or x'^, depending on how many samples for Vi are equal to one. Its 
expected value is then E{fl^^^ — « 0.658. Note that the positive bias is 
even higher than the bias of the ME. This is due to our uniform prior: if the prior 
on the individual variables is uniform, this implies the prior for the maximum 
expected value is negatively skewed, and its expected value is increased. The 
effect is already apparent with two variables, but it increases further with the 
number of variables due to the shape of i^max(a;) = YifLi Pi{x). 



6 Conclusion 

We analyzed the bias and variance of the two most common estimators for the 
maximum expected value of a set of random variables. The maximum estimate 
results in non-negative bias. The common alternative of cross validation (CV) 
has non-positive bias, which can be preferable. Unfortunately, the accuracies of 
different variants of CV are very dependent on the setting; an uninformed choice 
can result in extremely inaccurate estimates. No general rule — e.g., always use 
10-fold CV — is always optimal. 

Appendix 

Proof of Theorem[l[ For conciseness, we leave V and X implicit. Let j G O be 
an arbitrary optimal index, and define event Aj = (j G A^) to be true if and 
only if j is maximal. We can write 

- P{Aj)E{fij \A,} + Pi^A,)E{fi, I -.Aj} . 

Note: E{ii^ \ A^} = E{fif'' \ A^} and Eifij \ -^Aj} < Eifif"" \ -^Aj}. Therefore, 
< -E^IA*""}, with equality if and only if P {^A^) = for all j G C □ 

Proof of Theorem\^ Let A and B be independent sets of RVs with E{Ai] = 
E{B^} and E{Aj] ^ E{Bj). Define 

CW = {A \ A,) U {BJ = {Ai, . . . , B„ . . . , Am} ■ 
The Efron-Stein inequality [Efron and Stein . 1981 states that for any g : M^"^ — ?> 



M . 

YaT{g{A))<-J2E\ 
i=i ^ 
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Let A and B be independent instantiations of /t and let g{A) = max^ Ai for any 
A. We derive 

Var(/i^'=) < ^ > E{ I maxAj - maxC^*^ 



M M 

<-J2E{{A,-B,f} = J2VaT{f,,) . □ 



Proof of Theorem\^ Let ee e }. Then £:{A*} = wf/i^ < 

/z*, because M'' and u*^ are independent, Y^^' = 1 and Sj^'f } = Mi- Note 
that w'i > if and only if P (i 6 M^) > 0. Therefore, £^{AJ} < if and only 
if there exists a i ^ O such that P [i e M'") > 0. □ 

Proof of Conjecture]^ for M = 2. Assume without loss of generality that fii = 
/X*. The assumption = /i^ implies that E\^v^^ — /i^. Then, 

> P (2 e A^*'-) (Ai2 - Ml) 
= P{a^> a'l) (a*2 - Ml) 

(Var(a5;-)+Var(a|))(M2-Mi) 



> 



Var (af) + Var (a^) + (^i - ^a)^ 



>--.^Var(at)+Var(a|) , 

where the second inequality follows from Cantelli's inequality, and the third 
inequality is the result of minimizing for ^2 ^ Mi- From this, it follows that for 

M = 2 

Bias(Ar)<-^E^ 



fe=i \ 1=1 



which is a factor i tighter than the general bound in the conjecture. □ 

Proof of Theorem]^ We apply definition ([TOl) and use X^fcLi — Y^k=i A? to 
derive 

\ fe=l ' ' ieM>' I \ fc=l i=\ / 

K M 



K M 
fe=l i=l 



Proof of Corollary]^ Apply Theorem[5]with Var (/if) = erf | = Kaj/\X,\ = 
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